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Abstract 

Automatic  recognition  of  Arabic  dialectal  speech  is 
a  challenging  task  because  Arabic  dialects  are  es¬ 
sentially  spoken  varieties.  Only  few  dialectal  re¬ 
sources  are  available  to  date;  moreover,  most  avail¬ 
able  acoustic  data  collections  are  transcribed  with¬ 
out  diacritics.  Such  a  transcription  omits  essen¬ 
tial  pronunciation  information  about  a  word,  such 
as  short  vowels.  In  this  paper  we  investigate  var¬ 
ious  procedures  that  enable  us  to  use  such  train¬ 
ing  data  by  automatically  inserting  the  missing  dia¬ 
critics  into  the  transcription.  These  procedures  use 
acoustic  information  in  combination  with  different 
levels  of  morphological  and  contextual  constraints. 
We  evaluate  their  performance  against  manually  dia- 
critized  transcriptions.  In  addition,  we  demonstrate 
the  effect  of  their  accuracy  on  the  recognition  perfor¬ 
mance  of  acoustic  models  trained  on  automatically 
diacritized  training  data. 

1  Introduction 

Large-vocabulary  automatic  speech  recognition 
(ASR)  for  conversational  Arabic  poses  several 
challenges  for  the  speech  research  community. 
The  most  difficult  problems  in  developing  highly 
accurate  speech  recognition  systems  for  Arabic 
are  the  predominance  of  non-diacritized  text 
material,  the  enormous  dialectal  variety,  and 
the  morphological  complexity. 

Most  available  acoustic  training  material  for 
Arabic  ASR  is  transcribed  in  the  Arabic  script 
form,  which  does  not  include  short  vowels  and 
other  diacritics  that  reflect  differences  in  pro¬ 
nunciation,  such  as  the  shadda,  tanween,  etc.  In 
particular,  almost  all  additional  text  data  that 
can  easily  be  obtained  (e.g.  broadcast  news  cor¬ 
pora)  is  represented  in  standard  script  form.  To 
our  knowledge,  the  only  available  corpus  that 
does  include  detailed  phonetic  information  is 
the  CallHome  (CH)  Egyptian  Colloquial  Ara¬ 
bic  (EGA)  corpus  distributed  by  the  Linguis¬ 
tic  Data  Consortium  (LDC).  This  corpus  has 
been  transcribed  in  both  the  script  form  and 


Katrin  Kirchhoff 

Department  of  Electrical  Engineering, 
University  of  Washington, 
Seattle,  WA  98195,  USA 
katrin@ee.washington.edu 


a  so-called  romanized  form,  which  is  an  ASCII 
representation  that  includes  short  vowels  and 
other  diacritic  information  and  thus  has  com¬ 
plete  pronunciation  information.  It  is  quite 
challenging  to  create  such  a  transcription:  na¬ 
tive  speakers  of  Arabic  are  not  used  to  writing 
their  language  in  a  ’’romanized”  form,  or  even  in 
fully  diacritized  script  form.  Consequently,  this 
task  is  considered  almost  as  difficult  as  phonetic 
transcription.  Transcribing  a  sufficiently  large 
amount  of  training  data  in  this  way  is  there¬ 
fore  labor-intensive  and  costly  since  it  involves 
(re)-training  native  speakers  for  this  purpose. 

The  constraint  of  having  mostly  non- 
diacritized  texts  as  recognizer  training  material 
leads  to  problems  for  both  acoustic  and  lan¬ 
guage  modeling.  Eirst,  it  is  difficult  to  train 
accurate  acoustic  models  for  short  vowels  if 
their  identity  and  location  in  the  signal  is  not 
known.  Second,  the  absence  of  diacritics  leads 
to  a  larger  set  of  linguistic  contexts  for  a  given 
word  form;  language  models  trained  on  non- 
diacritized  material  may  therefore  be  less  pre¬ 
dictive  than  those  trained  on  diacritized  texts. 
Both  of  these  factors  may  lead  to  a  loss  in 
recognition  accuracy.  Previous  work  (Kirchhoff 
et  ah,  2002;  Lamel,  2003)  has  shown  that  ig¬ 
noring  available  vowel  information  does  indeed 
lead  to  a  significant  increase  in  both  language 
model  perplexity  and  word  error  rate.  There¬ 
fore,  we  are  interested  in  automatically  deriv¬ 
ing  a  diacritized  transcription  from  the  Arabic 
script  representation  when  a  manual  diacritiza¬ 
tion  is  not  available.  Some  software  companies 
(Sakhr,  Apptek,  RDI)  have  developed  commer¬ 
cial  products  for  the  automatic  diacritization  of 
Arabic.  However,  these  products  use  only  text- 
based  information,  such  as  the  syntactic  context 
and  possible  morphological  analyses  of  words,  to 
predict  diacritics.  In  the  context  of  diacritiza¬ 
tion  for  speech  recognition,  by  contrast,  acous¬ 
tic  data  is  available  that  can  be  used  as  an  ad¬ 
ditional  knowledge  source.  Moreover,  commer- 
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cial  products  concentrate  exclusively  on  Modern 
Standard  Arabic  (MSA) ,  whereas  a  common  ob¬ 
jective  of  Arabic  ASR  is  conversational  speech 
recognition,  which  is  usually  dialectal.  For  this 
reason,  a  more  flexible  set  of  tools  is  required 
in  order  to  diacritize  dialectal  Arabic  prior  to 
speech  recognizer  training. 

In  this  work  we  investigate  the  relative  ben¬ 
efits  of  a  variety  of  knowledge  sources  (acous¬ 
tic,  morphological,  and  contextual)  to  automat¬ 
ically  diacritize  MSA  transcriptions.  We  eval¬ 
uate  the  different  approaches  in  two  different 
ways:  (a)  by  comparing  the  automatic  output 
against  a  manual  reference  diacritization  and 
computing  the  diacritization  error  rate,  and  (b) 
by  using  automatically  diacritized  training  data 
in  a  cross-dialectal  speech  recognition  applica¬ 
tion. 

The  remainder  of  this  paper  is  structured  as 
follows:  Section  2  gives  a  detailed  description  of 
the  motivation  as  well  as  prior  work.  Section  3 
describes  the  corpora  used  for  the  experiments 
reported  in  this  paper.  The  automatic  diacriti¬ 
zation  procedures  and  results  are  explained  in 
Section  4.  The  speech  recognition  experiments 
and  results  are  reported  in  Section  5.  Section  6 
presents  our  conclusions. 

2  Motivation  and  Prior  Work 

We  first  describe  the  Arabic  writing  system 
and  its  inherent  problems  for  speech  recognizer 
training,  and  then  discuss  previous  attempts  at 
automatic  diacritization. 

2.1  The  Arabic  Writing  System 

The  Arabic  alphabet  consists  of  twenty-eight 
letters,  twenty- five  of  which  represent  conso¬ 
nants  and  three  of  which  represent  the  long 
vowels  (/i:/,/a:/,/u:/).  A  distinguishing  fea¬ 
ture  of  Arabic-script  based  writing  systems  is 
that  short  vowels  are  not  represented  by  the 
letters  of  the  alphabet.  Instead,  they  are 
marked  by  so-called  diacritics,  short  strokes 
placed  either  above  or  below  the  preceding  con¬ 
sonant.  Several  other  pronunciation  phenom¬ 
ena  are  marked  by  diacritics,  such  as  consonant 
doubling  (phonemic  in  Arabic),  which  is  indi¬ 
cated  by  the  “shadda”  sign,  and  the  “tanween” , 
i.e.  word-final  adverbial  markers  that  add  /n/  to 
the  pronunciation  of  the  word.  These  diacritics 
are  listed  in  Table  1.  Arabic  texts  are  almost 
never  fully  diacritized;  normally,  diacritics  are 
used  sparingly  and  only  to  prevent  misunder¬ 
standings.  Exceptions  are  important  religious 
and/or  political  texts  or  beginners’  texts  for 


MSA 

Symbol  Name 

Meaning 

1 

fatHa 

/a/ 

1 

kasra 

N 

1 

Damma 

N 

j 

shadda 

consonant  doubling 

sukuun 

vowel  absence 

1 

tanween  al-fatHa 

/an/ 

1 

tanween  al-kasr 

/in/ 

\ 

tanween  aD-Damm 

/un/ 

Table  1:  Arabic  diacritics 


students  of  Arabic.  The  lack  of  diacritics  may 
lead  to  considerable  lexical  ambiguity  that  must 
be  resolved  by  contextual  information,  which 
in  turn  presupposes  knowledge  of  the  language. 
It  was  observed  in  (Debili  et  ah,  2002)  that 
a  non-diacritized  dictionary  word  form  has  2.9 
possible  diacritized  forms  on  average  and  that 
an  Arabic  text  containing  23,000  word  forms 
showed  an  average  ratio  of  1:11.6.  The  form 
for  instance,  has  21  possible  diacritiza- 

tions.  The  correspondence  between  graphemes 
and  phonemes  is  relatively  transparent  com¬ 
pared  to  other  languages  like  English  or  Erench: 
apart  from  certain  special  graphemes  (e.g.  laam 
alif),  the  relationship  is  one  to  one.  Einally, 
it  is  worth  noting  that  the  writing  system  de¬ 
scribed  above  is  that  of  MSA.  Arabic  dialects 
are  primarily  oral  varieties  in  that  they  do  not 
have  generally  agreed-upon  writing  standards. 
Whenever  there  is  the  need  to  write  down  di¬ 
alectal  speech,  speakers  will  try  to  approximate 
the  standard  system  as  far  as  possible  and  use  a 
phonetic  spelling  for  non-MSA  or  foreign  words. 

The  lack  of  diacritics  in  standard  Arabic  texts 
makes  it  difficult  to  use  non-diacritized  text  for 
training  since  the  location  and  identity  of  short 
vowels  and  other  phonetic  segments  are  un¬ 
known.  One  possible  approach  is  to  use  acous¬ 
tic  models  for  long  vowels  and  consonants  only, 
where  the  acoustic  signal  portions  correspond¬ 
ing  to  unwritten  segments  are  implicitly  incor¬ 
porated  into  the  acoustic  models  for  consonants 
(Billa  et  al,  2002).  However,  this  leads  to  less 
discriminative  acoustic  and  language  models. 
Previous  work  (Kirchhoff  et  ah,  2002;  Lamel, 
2003)  has  compared  the  word  error  rates  of 
two  CH  EGA  recognizers:  one  trained  on  script 
transcriptions  and  another  trained  on  roman- 
ized  transcriptions.  It  was  shown  that  the  loss 
in  information  due  to  training  on  script  forms 


results  in  significantly  worse  performance:  a  rel¬ 
ative  increase  in  word  error  rate  of  almost  10% 
was  observed. 

It  seems  clear  that  diacritized  data  should  be 
used  for  training  Arabic  ASR  systems  whenever 
possible.  As  explained  above,  however,  it  is  very 
expensive  to  obtain  manually  transcribed  data 
in  a  diacritized  form.  Therefore,  the  corpora 
that  do  include  detailed  transcriptions  are  fairly 
small  and  any  dialectal  data  that  might  become 
available  in  the  future  will  also  very  likely  be 
of  limited  size.  By  contrast,  it  is  much  easier 
to  collect  publicly  available  data  (e.g.  broadcast 
news  data)  and  to  transcribe  it  in  script  form. 
In  order  to  be  able  to  take  advantage  of  such 
resources,  we  need  to  restore  short  vowels  and 
other  missing  diacritics  in  the  transcription. 

2.2  Prior  Work 

Various  software  companies  have  developed 
automatic  diacritization  products  for  Arabic. 
However,  all  of  these  are  targeted  towards  MSA; 
to  our  knowledge,  there  are  no  products  for  di¬ 
alectal  Arabic.  In  a  previous  study  (Kirchhoff 
et  ah,  2002)  one  of  these  products  was  tested 
on  three  different  texts,  two  MSA  texts  and  one 
EGA  text.  It  was  found  that  the  diacritization 
error  rate  (percentage  of  missing  and  wrongly 
identified  or  inserted  diacritics)  on  MSA  ranged 
between  9%  and  28%,  depending  on  whether  or 
not  case  vowel  endings  were  counted.  However, 
on  the  EGA  text,  the  diacritization  software  ob¬ 
tained  an  error  rate  of  48%. 

A  fully  automatic  approach  to  diacritization 
was  presented  in  (Gal,  2002),  where  an  HMM- 
based  bigram  model  was  used  for  decoding 
diacritized  sentences  from  non-diacritized  sen¬ 
tences.  The  technique  was  applied  to  the  Quran 
and  achieved  14%  word  error  (incorrectly  dia¬ 
critized  words). 

A  first  attempt  at  developing  an  automatic 
diacritizer  for  dialectal  speech  was  reported  in 
(Kirchhoff  et  ah,  2002).  The  basic  approach 
was  to  use  a  small  set  of  parallel  script  and  dia¬ 
critized  data  (obtained  from  the  EGA  GallHome 
corpus)  and  to  derive  diacritization  rules  in  an 
example-based  way.  This  entirely  knowledge- 
free  approach  achieved  a  16.6%  word  error  rate. 

Other  studies  (El-Imam,  2003)  have  ad¬ 
dressed  problems  of  grapheme-to-phoneme  con¬ 
version  in  Arabic,  e.g.  for  the  purpose  of  speech 
synthesis,  but  have  assumed  that  a  fully  dia¬ 
critized  version  of  the  text  is  already  available. 

Several  knowledge  sources  are  available  for 


determining  the  most  appropriate  diacritization 
of  a  script  form:  analysis  of  the  morphological 
structure  of  the  word  (including  segmentation 
into  stems,  prefixes,  roots  and  patterns),  con¬ 
sideration  of  the  syntactic  context  in  which  the 
word  form  appears,  and,  in  the  context  of  speech 
recognition,  the  acoustic  data  that  accompanies 
the  transcription.  Specific  dictionary  informa¬ 
tion  could  in  principle  be  added  (such  as  infor¬ 
mation  about  proper  names),  but  this  knowl¬ 
edge  source  is  ignored  for  the  purpose  of  this 
study.  All  of  the  approaches  described  above 
make  use  of  text-based  information  only  and  do 
not  attempt  to  use  acoustic  information. 

3  Data 

Eor  the  present  study  we  used  two  different  cor¬ 
pora,  the  EBIS  corpus  of  MSA  speech  and  the 
LDG  GallHome  EGA  corpus. 

The  EBIS  corpus  is  a  collection  of  radio  news¬ 
casts  from  various  radio  stations  in  the  Ara¬ 
bic  speaking  world  (Gairo,  Damascus,  Bagh¬ 
dad)  totaling  approximately  40  hours  of  speech 
(roughly  240K  words).  The  transcription  of  the 
EBIS  corpus  was  done  in  Arabic  script  only 
and  does  not  contain  any  diacritic  information. 
There  were  a  total  of  54K  different  script  forms, 
with  an  average  of  2.5  different  diacritizations 
per  word. 

The  GallHome  corpus,  made  available  by 
LDG,  consists  of  informal  telephone  conversa¬ 
tions  between  native  speakers  (friends  and  fam¬ 
ily  members)  of  Egyptian  Arabic,  mostly  from 
the  Gairene  dialect  region.  The  corpus  con¬ 
sists  of  about  20  hours  of  training  data  (roughly 
160K  words)  and  6  hours  of  test  data.  It  is  tran¬ 
scribed  in  two  different  ways:  (a)  using  stan¬ 
dard  Arabic  script,  and  (b)  using  a  romaniza- 
tion  scheme  developed  at  LDG  and  distributed 
with  the  corpus.  The  romanized  transcription 
contains  short  vowels  and  phonetic  segments 
corresponding  to  other  diacritics.  It  is  not  en¬ 
tirely  equivalent  to  a  diacritized  Arabic  script 
representation  since  it  includes  additional  in¬ 
formation.  Eor  instance,  symbols  particular  to 
Egyptian  Arabic  were  used  (e.g.  ”g”  for  /g/, 
the  EGA  pronunciation  of  the  MSA  letter  ^), 

whereas  the  script  transcriptions  contain  MSA 
letters  only.  In  general,  the  romanized  tran¬ 
scription  provides  more  information  about  ac¬ 
tual  pronunciation  and  is  thus  closer  to  a  broad 
phonetic  transcription. 


4  Automatic  Diacritization 

We  describe  three  techniques  for  the  automatic 
diacritization  of  Arabic  text  data.  The  first 
combines  acoustic,  morphological  and  contex¬ 
tual  information  to  predict  the  correct  form,  the 
second  ignores  contextual  information,  and  the 
third  is  fully  acoustics  based.  The  latter  tech¬ 
nique  uses  no  morphological  or  syntactic  con¬ 
straints,  and  allows  for  all  possible  items  to  be 
inserted  at  every  possible  position. 

4.1  Combination  of  Acoustic, 

Morphological  and  Contextual 
Information 

Most  Arabic  script  forms  can  have  a  number 
of  possible  morphological  interpretations,  which 
often  correspond  to  different  diacritized  forms. 
Our  goal  is  to  combine  morphological  knowledge 
with  contextual  information  in  order  to  identify 
possible  diacritizations  and  assign  probabilities 
to  them.  Our  procedure  is  as  follows: 

1.  Generate  all  possible  diacritized  variants 
for  each  word,  along  with  their  morphological 
analyses  (tags). 

2.  Train  an  unsupervised  tagger  to  assign 
probabilities  to  sequences  of  these  morpholog¬ 
ical  tags. 

3.  Use  the  trained  tagger  to  assign  proba¬ 
bilities  to  all  possible  diacritizations  for  a  given 
utterance. 

For  the  first  step  we  used  the  Buckwalter 
stemmer,  which  is  an  Arabic  morphological 
analysis  tool  available  from  the  LDC.  The  stem¬ 
mer  produces  all  possible  morphological  anal¬ 
yses  of  a  given  Arabic  script  form;  as  a  by¬ 
product  it  also  outputs  the  concomitant  dia¬ 
critized  word  forms.  An  example  of  the  output 
is  shown  in  Figure  1.  The  next  step  was  to  train 
an  unsupervised  tagger  on  the  output  to  obtain 
tag  n-gram  probabilities.  The  number  of  differ¬ 
ent  morphological  tags  generated  by  applying 
the  stemmer  to  the  FBIS  text  was  763.  In  or¬ 
der  to  obtain  a  smaller  tag  set  and  to  be  able 
to  estimate  probabilities  for  tag  sequences  more 
robustly,  this  initial  tag  needed  to  be  conflated 
to  a  smaller  set.  We  adopted  the  set  used  in 
the  LDC  Arabic  TreeBank  project,  which  was 
also  developed  based  on  the  Buckwalter  mor¬ 
phological  analysis  scheme.  The  FBIS  tags  were 
mapped  to  TreeBank  tags  using  longest  com¬ 
mon  substring  matching;  this  resulted  in  392 
tags.  Further  possible  reductions  of  the  tag 
set  were  investigated  but  it  was  found  that  too 
much  clustering  (e.g.  of  verb  subclasses  into  a 


LOOK-UP  WORD:  JJ  (qbl) 

SOLUTION  1:  (qabola)  qabola/PREP 
(GLOSS):  -h  before  -h 
SOLUTION  2:  (qaboli)  qaboli/PREP 
(GLOSS):  -h  before  -h 
SOLUTION  3:  (qabolu)  qabolu/ADV 
(GLOSS):  -I-  before/prior  -|- 
SOLUTION  4:(qibal)  qibal/NOUN 
(GLOSS):  -I-  (on  the)  part  of  -I- 
SOLUTION  5:(qabila) 

qabil/VERBT’ERFEGT-ha/PVSUFF_SUBJ:3MS 
(GLOSS):  -I-  accept/receive/approve  -I-  he/it  <verb> 
SOLUTION  6:  (qab'ala) 

qab  al/VERBT’ERFEGT-ha/PVSUFF_SUBJ:3MS 
(GLOSS):  -I-  kiss  -I-  he/it  <verb> 

Figure  1:  Sample  output  of  Buckwalter  stem¬ 
mer  showing  the  possible  diacritizations  and 
morphological  analyses  of  the  script  form  J.J 

(qbl).  Lower-case  o  stands  for  sukuun  (lack  of 
vowel) . 

single  verb  class)  could  result  in  the  loss  of  im¬ 
portant  information.  For  instance,  the  tense 
and  voice  features  of  verbs  are  strong  predictors 
of  the  short  vowel  patterns  and  should  therefore 
be  preserved  in  the  tagset. 

We  adopted  a  standard  statistical  trigram 
tagging  model: 

P{to,  .  .  .,tn\wo,  ...,Wn)  = 

n 

Y[P{Wi\ti)P{ti\ti-i,ti-2)  (1) 

i=0 

where  t  is  a  tag,  w  is  a  word,  and  n  is  the  to¬ 
tal  number  of  words  in  the  sentence.  In  this 
model,  words  (i.e.  non-diacritized  script  forms) 
and  morphological  tags  are  treated  as  observed 
random  variables  during  training.  Training  is 
done  in  an  unsupervised  way,  i.e.  the  correct 
morphological  tag  assignment  for  each  word  is 
not  known.  Instead,  all  possible  assignments 
are  initially  considered  and  the  Expectation- 
Maximization  (EM)  training  procedure  itera¬ 
tively  trains  the  probability  distributions  in  the 
above  model  (the  probability  of  word  given 
tag,  P{wi\ti),  and  the  tag  sequence  probabil¬ 
ity,  P{ti\ti-i,ti-2))  until  convergence.  During 
testing,  only  the  word  sequence  is  known  and 
the  best  tag  assignment  is  found  by  maximiz¬ 
ing  the  probability  in  Equation  1.  We  used  the 
graphical  modeling  toolkit  GMTK  (Bilmes  and 
Zweig,  2002)  to  train  the  tagger.  The  trained 
tagger  was  then  used  to  assign  probabilities  to 
all  possible  sequences  of  three  successive  mor- 


phological  tags  and  their  associated  diacritiza- 
tions  to  all  utterances  in  the  FBIS  corpus. 

Using  the  resulting  possible  diacritizations 
for  each  utterance  we  constructed  a  word- 
pronunciation  network  with  the  probability 
scores  assigned  by  the  tagger  acting  as  transi¬ 
tion  weights.  These  word  networks  were  used 
as  constraining  recognition  networks  with  the 
acoustic  models  trained  on  the  CallHome  cor¬ 
pus  to  find  the  most  likely  word  sequence  (a 
process  called  alignment).  We  performed  this 
procedure  with  different  weights  on  the  tagger 
probabilities  to  see  how  much  this  information 
should  be  weighted  compared  to  the  acoustic 
scores.  Results  for  weights  1  and  5  are  reported 
below. 

Since  the  Buckwalter  stemmer  does  not  pro¬ 
duce  case  endings,  the  word  forms  obtained 
by  adding  case  endings  were  included  as  vari¬ 
ants  in  the  pronunciation  dictionary  used  by  the 
aligner.  Additional  variants  listed  in  the  dictio¬ 
nary  are  the  taa  marbuta  alternations  /a/  and 
/at/.  In  some  cases  (approximately  1.5%  of  all 
words)  the  Buckwalter  stemmer  was  not  able  to 
produce  an  analysis  of  the  word  form  due  to  mis¬ 
spellings  or  novel  words.  These  were  mapped  to 
a  generic  reject  model. 

4.2  Combination  of  Acoustic  and 
Morphological  Constraints 

We  were  interested  in  separately  evaluating  the 
usefulness  of  the  probabilistic  contextual  knowl¬ 
edge  provided  by  the  tagger,  and  the  morpho¬ 
logical  knowledge  contributed  by  the  Buckwal¬ 
ter  tool.  To  that  end  we  used  the  word  networks 
produced  by  the  method  described  above  but 
stripped  the  tagger  probabilities,  thus  assigning 
uniform  probability  to  all  diacritized  forms  pro¬ 
duced  by  the  morphological  analyzer.  We  used 
the  same  acoustic  models  to  find  the  most  likely 
alignment  from  the  word  networks. 

4.3  Using  only  Acoustic  Information 

Similarly,  we  wanted  to  evaluate  the  importance 
of  using  morphological  information  versus  only 
acoustic  information  to  constrain  the  possible 
diacritizations.  This  is  particularly  interesting 
since,  as  new  dialectal  speech  data  become  avail¬ 
able,  the  acoustics  may  be  the  only  informa¬ 
tion  source.  As  explained  above,  existing  mor¬ 
phological  analysis  tools  such  as  the  Buckwalter 
stemmer  have  been  developed  for  MSA  only. 

For  that  purpose,  we  generated  word  net¬ 
works  that  include  all  possible  short  vowels  at 
each  allowed  position  in  the  word  and  allowed 


all  possible  case  endings.  This  means  that  af¬ 
ter  every  consonant  there  are  at  least  5  dif¬ 
ferent  choices:  no  vowel  (corresponding  to  the 
sukuun  diacritic),  /i/,  /a/,  /u/,  or  consonant 
doubling  caused  by  a  shadda  sign.  Combina¬ 
tions  of  shadda  and  a  short  vowel  are  also  pos¬ 
sible.  Since  we  do  not  use  acoustic  models  for 
doubled  consonants  in  our  speech  recognizer,  we 
ignore  the  variants  involving  shadda  and  allow 
only  four  possibilities  after  every  word-medial 
consonant:  the  three  short  vowels  or  absence  of 
a  vowel.  Finally,  we  include  the  three  tanween 
endings  in  addition  to  these  four  possibilities  in 
word-final  position.  As  before,  the  taa  marbuta 
variants  are  also  included. 

In  this  way,  many  more  possible  “pronuncia¬ 
tions”  are  generated  for  a  script  form  than  could 
ever  occur.  The  number  of  possible  variants  in¬ 
creases  exponentially  with  the  number  of  pos¬ 
sible  vowel  slots  in  the  word.  For  instance,  for 
a  longer  word  with  7  possible  positions,  more 
than  16K  diacritized  forms  are  possible,  not 
even  counting  the  possible  word  endings.  As  be¬ 
fore,  we  use  these  large  pronunciation  networks 
to  constrain  our  alignment  with  acoustic  models 
trained  on  CallHome  data  and  choose  the  most 
likely  path  as  the  output  diacritization. 

In  principle  it  would  also  be  possible  to  deter¬ 
mine  diacritization  performance  in  the  absence 
of  acoustic  information,  using  only  morphologi¬ 
cal  and  contextual  knowledge.  This  can  be  done 
by  selecting  the  best  path  from  the  weighted 
word  transition  networks  without  rescoring  the 
network  with  acoustic  models.  However,  this 
would  not  lead  to  a  valid  comparison  in  our  case 
because  case  endings  are  only  represented  in  the 
pronunciation  dictionary  used  by  the  acoustic 
aligner;  they  are  not  present  in  the  weighted 
transition  network  and  thus  cannot  be  hypoth¬ 
esized  unless  the  acoustic  aligner  is  used. 

4.4  Autodiacritization  Error  Rates 

We  measured  the  performance  of  all  three  meth¬ 
ods  by  comparing  the  output  against  hand  tran¬ 
scribed  references  on  a  500  word  subset  of  the 
FBIS  corpus.  These  references  were  fully  dia¬ 
critized  script  transcriptions  created  by  a  na¬ 
tive  speaker  of  Arabic  who  was  trained  in  or¬ 
thographic  transcription  but  not  in  phonetic 
transcription.  The  diacritization  error  rate  was 
measured  as  the  percentage  of  wrong  diacritiza¬ 
tion  decisions  out  of  all  possible  decisions.  In 
particular,  an  error  occurs  when: 

•  a  vowel  is  inserted  although  the  reference 


transcription  shows  either  sukuun  or  no  dia¬ 
critic  mark  at  the  corresponding  position  (in¬ 
sertion). 

•  no  vowel  is  produced  by  the  automatic  pro¬ 
cedure  but  the  reference  contains  a  vowel  mark 
at  the  corresponding  position  (deletion) . 

•  the  short  vowel  inserted  does  not  match  the 
vowel  at  the  corresponding  position  (substitu¬ 
tion). 

•  in  the  case  of  tanween  and  taa  marbuta  end¬ 
ings,  either  the  required  consonants  or  vowels 
are  missing  or  wrongly  inserted.  Thus,  in  the 
case  of  a  taa  marbuwta  ending  with  a  following 
case  vowel  /i/,  for  instance,  both  the  /t/  and 
the  /i/  need  to  be  present.  If  either  is  missing, 
one  error  is  assigned;  if  both  are  missing,  two 
errors  are  assigned. 

Results  are  listed  in  Table  2.  The  first  column 
reports  the  error  rate  at  the  word  level,  i.e.  the 
percentage  of  words  that  contained  at  least  one 
diacritization  mistake.  The  second  column  lists 
the  diacritization  error  computed  as  explained 
above.  The  first  three  methods  have  a  very  sim¬ 
ilar  performance  with  respect  to  diacritization 
error  rate.  The  use  of  contextual  information 
(the  tagger  probabilities)  gives  a  slight  advan¬ 
tage,  although  the  difference  is  not  statistically 
significant.  Despite  these  small  differences,  the 
word  error  rate  is  the  same  for  all  three  meth¬ 
ods;  this  is  because  a  word  that  contains  at  least 
one  mistake  is  counted  as  a  word  error,  regard¬ 
less  of  the  total  number  of  mistakes  in  the  word, 
which  may  vary  from  system  to  system.  Using 
only  acoustic  information  doubles  the  diacriti¬ 
zation  error  rate  and  increases  the  word  error 
rate  to  50%.  Errors  result  mostly  from  incorrect 
insertions  of  vowels  (e.g.  .sljjy  ^  .sljJij).  Many 

of  these  insertions  may  stem  from  acoustic  ef¬ 
fects  created  by  neighbouring  consonants,  that 
give  a  vowel-like  quality  to  transitions  between 
consonants.  The  main  benefit  of  using  morpho¬ 
logical  knowledge  lies  in  the  prevention  of  such 
spurious  vowel  insertions,  since  only  those  inser¬ 
tions  are  permitted  which  result  in  valid  words. 
Even  without  the  use  of  morphological  infor¬ 
mation,  the  vast  majority  of  the  missing  vowels 
are  still  identified  correctly.  Thus,  this  method 
might  be  of  use  when  diacritizing  a  variety  of 
Arabic  for  which  morphological  analysis  tools 
are  not  available.  Note  that  the  results  obtained 
here  are  not  directly  comparable  to  any  of  the 
works  described  in  Section  2.2  since  we  used  a 
data  set  with  a  much  larger  vocabulary  size. 


Information  used 

Word 

level 

Character 

level 

acoustic  -1-  morphological 

-1-  contextual 

(tagger  prob.  weight=5) 

27.3 

13.24 

acoustic  -1-  morphological 

-1-  contextual 

(tagger  prob.  weight=l) 

27.3 

11.54 

acoustic  -1-  morphological 
(tagger  prob.  weight=0) 

27.3 

11.94 

acoustic  only 

50.0 

23.08 

Table  2:  Automatic  diacritization  error  rates 
(%)• 

5  ASR  Experiments 

Our  overall  goal  is  to  use  large  amounts  of  MSA 
acoustic  data  to  enrich  training  material  for  a 
speech  recognizer  for  conversational  Egyptian 
Arabic.  The  EGA  recognizer  was  trained  on  the 
romanized  transcription  of  the  CallHome  cor¬ 
pus  described  above  and  uses  short  vowel  mod¬ 
els.  In  order  to  be  able  to  use  the  phonetically 
deficient  MSA  transcriptions,  we  first  need  to 
convert  them  to  a  diacritized  form.  In  addition 
to  measuring  autodiacritization  error  rates,  as 
above,  we  would  like  to  evaluate  the  different 
diacritization  procedures  by  investigating  how 
acoustic  models  trained  on  the  different  outputs 
affect  ASR  performance. 

One  motivation  for  using  cross-dialectal  data 
is  the  assumption  that  infrequent  triphones  in 
the  CallHome  corpus  might  have  more  training 
samples  in  the  larger  MSA  corpus.  In  (Kirch- 
hoff  and  Vergyri,  2004)  we  demonstrated  that 
it  is  possible  to  get  a  small  improvement  in  this 
task  by  combining  the  scores  of  models  trained 
strictly  on  CallHome  (CH)  with  models  trained 
on  the  combined  EBIS-I-CH  data,  where  the 
FBIS  data  was  diacritized  using  the  method  de¬ 
scribed  in  Section  4.1.  Here  we  compare  that  ex¬ 
periment  with  the  experiments  where  the  meth¬ 
ods  described  in  Sections  4.2  and  4.3  were  used 
for  diacritizing  the  FBIS  corpus. 

5.1  Baseline  System 

The  baseline  system  was  trained  with  only 
CallHome  data  (CH-only).  For  these  exper¬ 
iments  we  used  a  single  front-end  (13  mel- 
frequency  cepstral  coefficients  with  first  and 
second  differences).  Mean  and  variance  as 
well  as  Vocal  Tract  Length  (VTL)  normaliza¬ 
tion  were  performed  per  conversation  side  for 
CH  and  per  speaker  cluster  (obtained  auto¬ 
matically)  for  FBIS.  We  trained  non-crossword. 


System 

dev96 

eval03 

simple  CH-only 

56.1 

42.7 

RT-2003  CH-only 

52.6 

39.7 

Table  3:  CH-only  baseline  WER  (%) 


continuous-density,  genonic  hidden  Markov 
models  (HMMs)  (Digalakis  and  Murveit,  1994), 
with  128  gaussians  per  genone  and  250  genones. 
Recognition  was  done  by  SRI’s  DECIPHER^^ 
engine  in  a  multipass  approach:  in  the  first 
pass,  phone-loop  adaptation  with  two  Max¬ 
imum  Likelihood  Linear  Regression  (MLLR) 
transforms  was  applied.  A  recognition  lexicon 
with  18K  words  and  a  bigram  language  model 
were  used  to  generate  the  first  pass  recogni¬ 
tion  hypothesis.  In  the  second  pass  the  acoustic 
models  were  adapted  using  constrained  MLLR 
(with  6  transformations)  based  on  the  previ¬ 
ous  hypothesis.  Bigram  lattices  were  generated 
and  then  expanded  using  a  trigram  language 
model.  Einally,  N-best  lists  were  generated  us¬ 
ing  the  adapted  models  and  the  trigram  lattices. 
The  final  best  hypothesis  was  obtained  using  N- 
best  ROVER  (?).  This  system  is  simpler  than 
our  best  current  recognition  system  (submitted 
for  the  NIST  RT-2003  benchmark  evaluations) 
(Stolcke  et  ah,  2003)  since  we  used  a  single  front 
end  (instead  of  a  combination  of  systems  based 
on  different  front  ends)  and  did  not  include 
HLDA,  cross-word  triphones,  MMIE  training 
or  a  more  complex  language  model.  The  lack 
of  these  features  resulted  in  a  higher  error  rate 
but  our  goal  here  was  to  explore  exclusively  the 
effect  of  the  additional  MSA  training  data  us¬ 
ing  different  diacritization  approaches.  Table  3 
shows  the  word  error  rates  of  the  system  used 
for  these  experiments  and  the  full  system  used 
for  the  NIST  RT-03  evaluations.  Our  full  sys¬ 
tem  was  about  2%  absolute  worse  than  the  best 
system  submitted  for  that  task.  This  shows  that 
even  though  the  system  is  simpler  we  are  not 
operating  far  from  the  state-of-the-art  perfor¬ 
mance  for  this  task. 

5.2  ASR  Systems  Using  FBIS  Data 

In  order  to  investigate  the  effect  of  additional 
MSA  training  data,  we  trained  a  system  similar 
to  the  baseline  but  used  training  data  pooled 
from  both  corpora  (CH-I-EBIS).  After  perform¬ 
ing  alignment  of  the  EBIS  data  with  the  net¬ 
works  described  in  Section  4.1,  10%  of  the  data 
was  discarded  since  no  alignments  could  be 
found.  This  could  be  due  to  segmentation  prob¬ 


lems  or  noise  in  the  acoustic  files.  The  remain¬ 
ing  90%  were  used  for  our  experiments.  In  or¬ 
der  to  account  for  the  fact  that  we  had  much 
more  data,  and  also  more  dissimilar  data,  we 
increased  the  model  size  to  300  genones. 

Eor  training  the  CH-I-EBIS  acoustic  models, 
we  first  used  the  whole  data  set  with  weight 
2  for  CH  utterances  and  1  for  EBIS  utterances. 
Models  were  then  MAP  adapted  on  the  CH-only 
data  (Digalakis  et  ah,  1995).  Since  training  in¬ 
volves  several  EM  iterations,  we  did  not  want 
to  keep  the  diacritization  fixed  from  the  first 
pass,  which  used  CH-only  models.  At  every  it¬ 
eration,  we  obtain  better  acoustic  models  which 
can  be  used  to  re- align  the  data.  Thus,  for  the 
first  two  approaches,  where  the  size  of  the  pro¬ 
nunciation  networks  is  limited  due  to  the  use 
of  morphological  information,  the  EM  forward- 
backward  counts  were  collected  using  the  whole 
diacritization  network  and  the  best  diacritiza¬ 
tion  path  was  allowed  to  change  at  every  iter¬ 
ation.  In  the  last  case,  where  only  acoustic  in¬ 
formation  was  used,  the  pronunciation  networks 
were  too  large  to  be  run  efficiently.  Eor  this  rea¬ 
son,  we  updated  the  diacritized  references  once 
during  training  by  realigning  the  networks  with 
the  newer  models  after  the  first  training  iter¬ 
ation.  As  reported  in  (Kirchhoff  and  Vergyri, 
2004)  the  CH-I-EBIS  trained  system  by  itself  did 
not  improve  much  over  the  baseline  (we  only 
found  a  small  improvement  on  the  eval03  test- 
set)  but  it  provided  sufficiently  different  infor¬ 
mation,  so  that  ROVER  combination  (Eiscus, 
1997)  with  the  baseline  yielded  an  improvement. 
As  we  can  see  in  Table  4,  all  diacritization  pro¬ 
cedures  performed  practically  the  same:  there 
was  no  significant  difference  in  the  word  error 
rates  obtained  after  the  combination  with  the 
CH-only  baseline.  This  suggests  that  we  may 
be  able  to  obtain  improvements  with  automat¬ 
ically  diacritized  data  even  when  using  inaccu¬ 
rate  diacritization,  produced  without  the  use  of 
morphological  constraints. 

6  Conclusions 

In  this  study  we  have  investigated  different  op¬ 
tions  for  automatically  diacritizing  Arabic  text 
for  use  in  acoustic  model  training  for  ASR.  A 
comparison  of  the  different  approaches  showed 
that  more  linguistic  information  (morphology 
and  syntactic  context)  in  combination  with 
the  acoustics  provides  lower  diacritization  er¬ 
ror  rates.  However,  there  is  no  significant  dif¬ 
ference  among  the  word  error  rates  of  ASR  sys- 


dev96 

eval03 

System 

alone 

Rover  with  GH-only 

alone 

Rover  with  GH-only 

GH-only 

56.1 

42.7 

GH+FBISl  (weight  1) 

56.3 

55.3 

42.2 

41.6 

GH+FBISl  (weight  5) 

56.1 

55.2 

42.2 

41.8 

GH-FFBIS2 

56.2 

55.3 

42.4 

41.6 

GH+FBIS3 

56.6 

55.7 

42.1 

41.6 

Table  4:  Word  error  rates  (%)  obtained  after  the  final  recognition  pass  and  with  ROVER  combina¬ 
tion  with  the  baseline  system.  FBISl,  FBIS2  and  FBIS3  correspond  to  the  diacritization  procedures 
described  in  Sections  4.1,  4.2  and  4.3  respectively.  For  the  first  approach  we  report  results  using 
the  tagger  probabilities  with  weights  1  and  5. 


terns  trained  on  data  resulting  from  the  different 
methods.  This  result  suggests  that  it  is  pos¬ 
sible  to  use  automatically  diacritized  training 
data  for  acoustic  modeling,  even  if  the  data  has 
a  comparatively  high  diacritization  error  rate 
(23%  in  our  case).  Note,  however,  that  one 
reason  for  this  may  be  that  the  acoustic  mod¬ 
els  are  finally  adapted  to  the  accurately  tran¬ 
scribed  CH-only  data.  In  the  future,  we  plan  to 
apply  knowledge-poor  diacritization  procedures 
to  other  dialects  of  Arabic,  for  which  morpho¬ 
logical  analyzers  do  not  exist. 
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